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Resumo 



Data-flow is a natural approach to parallelism. How- 
ever, describing dependencies and control between fine- 
grained data-flow tasks can be complex and present 
unwanted overheads. TALM (TALM is an Architec- 
ture and Language for Multi-threading) introduces a 
user-defined coarse-grained parallel data-flow model, 
where programmers identify code blocks, called super- 
instructions, to be run in parallel and connect them 
in a data-flow graph. TALM has been implemented 
as a hybrid Von Neumann/ data-flow execution system: 
the Trebuchet. We have observed that TALM's useful- 
ness largely depends on how programmers specify and 
connect super-instructions. Thus, we present Couil- 
lard, a full compiler that creates, based on an anno- 
tated C-program, a data-flow graph and C-code cor- 
responding to each super-instruction. We show that 
our toolchain allows one to benefit from data-flow ex- 
ecution and explore sophisticated parallel programming 
techniques, with small effort. To evaluate our system 
we have executed a set of real applications on a large 
multi-core machine. Comparison with popular paral- 
lel programming methods shows competitive speedups, 
while providing an easier parallel programing approach. 



1. Introduction 

Data-flow programming provides a natural approach 
to parallelism, where instructions execute as soon as 
their input operands are available [HI |2T1 [TH US] . Ac- 
tually in dynamic data-flow, we may even have inde- 
pendent instructions from multiple iterations on a loop 
running simultaneously, as parts of the loop may run 
fast than others and reach next iterations. There- 
fore it is complex to describe control in data-flow, 
since instructions must only proceed to execution when 
operands from the same iteration match. However, this 
difficulty is compensated by the amount of parallelism 
exploited this way 

TALM (TALM is an Architecture and Language for 
Multi-threading) [HI [21 H] is an execution model de- 
signed to exploit the advantages of data-flow in multi- 
thread programming. A program in TALM is com- 
prised of code blocks called super-instructions and sim- 
ple instructions connected in a graph according to their 
data dependencies (i.e. a data-flow graph). To paral- 
lelize a program using the TALM model, the program- 
mer marks portions of code that are to become super- 
instructions and describe their dependencies. With this 
approach, parallelism comes naturally from data-flow 
execution. 

The major advantage of TALM is that it provides 
a coarse-grained parallel model that can take advan- 
tage of data-flow. It is also a very flexible model, as 
the main data-flow instructions are available, thus al- 
lowing full compilation of control in a data-flow fash- 



ion. This gives the programmer the latitude to choose 
from coarser to more fine-grained execution strategies. 
This approach contrasts with previous work in data- 
flow programming [TU [5T1 [53], which often aimed at 
hiding data-flow execution from the programmer. 

A first implementation of TALM, the Trebuchet 
system, has been developed as a hybrid Von 
Neumann/data- flow execution system for thread-based 
architectures in shared memory platforms. Trebuchet 
emulates a data-flow machine that supports both 
simple instructions and super-instructions. Super- 
instructions are compiled as separate functions that 
are called by the runtime environment, while regular 
instructions are interpreted upon execution. Although 
Trebuchet needs to emulate data-flow instructions, ex- 
perience showed most running time is within our super- 
instructions. Initial results show the parallel engine 
to be competitive with state-of-the-art parallel appli- 
cations using OpenMP, both in terms of base perfor- 
mance, and in terms of speedups fT9l [H |3]. On the 
other hand, parallelism for for simple SPMD (Single- 
Program Multiple Data) applications can be explored 
quite well with tools such as OpenMP. The main bene- 
fits exploited by TALM become apparent when exper- 
imenting with applications that require more complex 
techniques, such as software pipelining or speculative 
execution. 

The usefulness of TALM clearly depends on how the 
programmer can specify and connect super-instructions 
together, including the complex task of describing con- 
trol using data-flow instructions. We therefore in- 
troduce Couillard, a C-compiler designed to compile 
TALM annotated C-programs into a data-flow graph, 
including the description of program control using dy- 
namic data-flow. Couillard is designed to insulate the 
programmer from the details of data-flow program- 
ming. By requiring the programmer to just anno- 
tate the code with the super-instruction definitions 
and their dependencies, Couillard greatly simplifies the 
task of parallelizing applications with TALM. 

This work makes two contributions: 

• We define the TALM language, as an extension 
of ANSI C and present a full implementation of 
the Couillard Compiler, which generates data-fiow 
graphs and super-instruction code for TALM. 

• We evaluate the performance of Couillard on two 
state-of-the-art PARSEC [6] benchmarks. We 
demonstrate that Trebuchet and Couillard allows 
one to explore complex parallel programing tech- 
niques, such as non-linear software pipelines and 
hiding I/O latency. Comparison with popular par- 
allel programming models, such as Pthreads [8], 



OpenMP [g and Intel Thread Building Blocks [21] 
shows that our approach is not just competitive 
with state-of-the-art technology, but that in fact 
can achieve better speedups by allowing one to eas- 
ily exploit a sophisticated design space for parallel 
programs. 

The paper is organized as follows. In Sect. [2] we 
briefly review TALM architecture and its implemen- 
tation, the Trebuchet. In Sect. [3] we describe TALM 
language and Couillard implementation. In Sect. [J] 
we present performance results on the two PARSEC 
benchmarks. In Sect. [5] we discuss some related works. 
Last, we present our conclusions and discuss future 
work. 

2. TALM and Trebuchet 

TALM [19] [H [3] allows application developers to 
take advantage of the possibilities available in the data- 
flow model in current Von Neumann architectures, in 
order to explore TLP in a more flexible way. TALM 
ISA sees applications in the form of a data-flow graph 
that can be run in parallel. 

A main contribution of TALM is that it enables pro- 
grammers to introduce user-defined instructions, the so 
called super-instructions. TALM assumes a contract 
with the programmer whether she or he guarantees 
that execution of the super-instruction can start if all 
inputs are available, and where she or he guarantees to 
make output arguments available as soon as possible, 
but not sooner. Otherwise, TALM has no information 
on the semantics of individual super-instructions, and 
indeed imposes no restrictions. Thus, a programmer 
can use shared memory in super-instructions without 
having to inform TALM. Although this requires ex- 
tra care from the programmer, the advantage is that 
TALM allows easy porting of imperative programs and 
easily allows program refinement. 

TALM has been implemented for multi-cores as a 
hybrid Von Neumann/data-flow execution system: the 
Trebuchet. Trebuchet is in fact a data-flow virtual ma- 
chine that has a set of data-flow processing elements 
(PEs) connected in a virtual network. Each PE is as- 
sociated with a thread at the host (Von Neumann) ma- 
chine. When a program is executed on Trebuchet, in- 
structions are loaded into the individual PEs and fired 
according to the Data-flow model. Independent in- 
structions will run in parallel if they are mapped to 
different PEs and there are available cores at the host 
machine to run those PEs' threads simultaneously. 

Trebuchet is a Posix-threads based implementation 
of TALM. It loads super-instructions as a dynami- 



cally linked library. At run-time, execution of super- 
instructions is fired by the virtual machine, according 
to the data-flow model, but their interpretation comes 
down to a procedure call resulting in the direct execu- 
tion of the related block. 
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Figure 1. Work-flow to follow when writing 
parallel applications with Trebuchet. 



Trebuchet may either rely solely on static schedul- 
ing of instructions among PEs or may also use work- 
stealing as a tool against imbalance. The work-stealing 
algorithm employed by Trebuchet is based on the ABP 
algorithm [J], the main difference being that the algo- 
rithm developed for Trebuchet provides a FIFO double- 
ended queue (deque) instead of a LIFO one, as is the 
case for the ABP algorithm. The FIFO order is cho- 
sen so that older instructions have execution priority, 
which is desirable for the applications we target at this 
moment. 

Figure [1] shows the work-flow to be followed in or- 
der to parallelize a sequential program and execute 
on Trebuchet. Initially, blocks that will form super- 
instructions are defined. Then, a super-instruction 
code extraction is performed to transform all blocks 
into functions that will collect input operands from 
Trebuchet, process and return output operands. Pro- 
filing tools may be used in helping to determine which 
portions of code are interesting candidates for paral- 
lelization. 

In the next step, the transformed blocks are com- 
piled into a dynamic library, which will be available 
to the abstract machine interpreter. Then, a data-flow 



graph connecting all blocks is defined and the data- 
flow assembly code is generated. The code may have 
both super-instructions and simple (fine-grained) in- 
structions. TALM provides all the standard data and 
control instructions that one would expect in a dynamic 
data-flow machine. 

Last, a data-flow binary is generated from the as- 
sembly, processor placement is defined, and the binary 
code is loaded and executed. As said above, execu- 
tion of simple instructions requires full interpretation, 
whereas super-instructions are directly executed on the 
host machine. 

In [21 [5] TALM was used to parallelize a set of 7 ap- 
plications: a matrix determinant calculation, a matrix 
multiplication application, a ray tracing application, 
Equake from SpecOMP 2001, IS from NPB3.0-OMP, 
and also LU and Mandelbrot from the OpenMP Source 
code Repository JTj. The achieved speedups for 8 
threads, in relation to the sequential versions were, re- 
spectively 2.52, 4.16, 4.39, 3.61, 3.00, 2.19 and 7.16. 
On the other hand, OpenMP versions of those bench- 
marks have provided speedups of 1.96, 4.15, 4.39, 3.40, 
3.11, 2.19 and 7.13. These results are very promising, 
and show that Trebuchet can be very competitive with 
OpenMP for regular applications. 

Trebuchet provides a natural platform for exper- 
imenting with advanced parallel programming tech- 
niques. In [19] a thread-level speculation model based 
on optimistic transactions with ordered commits was 
created for TALM and implemented in Trebuchet. Ex- 
ecution of speculative instructions is done within trans- 
actions, each one formed by one speculative instruction 
and its related Commit instruction. Transactions will 
have access only to local copies of the used resources. 
Once they finish running, if no conflicts are found, lo- 
cal changes will be persisted to global state by Commit 
instructions, associated with each speculative instruc- 
tion. In case conflicts are found in a speculative in- 
struction /, local changes will be discarded and / will 
have to be re-executed. 

Using speculative execution liberates the program- 
mer to consider only explicit dependencies while guar- 
anteeing correct execution of coarse-grained tasks. 
Moreover, the speculation mechanism does not demand 
centralized control, which is a key feature for upcom- 
ing many-core systems, where scalability has become 
an important concern. To evaluate the speculation sys- 
tem, a bank server simulator artificial application was 
implemented to simulate scenarios varying computa- 
tion load, transaction size, speculation depth, and con- 
tention. Results of execution of this application with 
up to 24 threads in a 24-core machine suggest that 
there is a wide range of situations where specidation 



can be very effective and indeed achieve speedups close 
to the ideal case. 

3. Compilation 

The data-flow model exposes thread-level paral- 
lelism by taking advantage of how data is exchanged 
between processing elements. In this vein, program- 
ming in TALM is about identifying parallel tasks and 
how data is consumed and produced between them. 
The initial Trebuchet implementation provided an ex- 
ecution environment for multi-cores, plus an assembler 
and loader. It was up to the programmer to code super- 
instructions in the library and to write TALM assem- 
bly code linking the different instructions together and 
specifying control trough data-flow instructions, not al- 
ways a trivial task. 

In this work we propose Couillard, a C-compiler for 
data-flow style execution. With Couillard^ the pro- 
grammer annotates blocks of code that are going to 
become super-instructions, and further annotates the 
program variables that correspond to their inputs and 
outputs. Couillard then produces the C-code corre- 
sponding to each super-instruction to be next compiled 
as a shared object to the target architecture and loaded 
by Trebuchet. Moreover, Couillard generates TALM 
assembly code to connect all super-instructions accord- 
ing to the user's specification. This assembly code 
represents the actual data-flow graph of the program. 
Moreover, control constructs such as loops and if-then- 
else statements that are not within super-instruction 
will also be compiled to TALM assembly code. This 
assembly code will then be used by Trebuchet to guide 
execution, following the data-flow rules. 

Couillard front-end uses PLY (Python Lex-Yacc) [5] 
and a grammar that is a subset of ANSI-C extended 
with super-instruction constructs. Couillard back-end, 
to generates TALM assembly code for TALM, super- 
instructions C-code (to be compiled into a dynamically 
linked library) and a graph representation of the pro- 
gram, using Graphviz notation [1]. 

3.1. Front-end 

We assume that super-instructions take most of the 
running time of an application, as regular instruc- 
tions are mostly used to describe the data and con- 
trol relations between super-instructions. Since super- 
instruction code will be compiled using a regular C- 
compiler and regular instructions tend to be simple, 
Couillard does not need to support the full ANSI-C 
grammar. 



Couillard, therefore adopts a subset of the ANSI- 
C grammar extended to support data-flow directives 
relative to super-instructions and their dependencies. 
We have also changed the syntax of variable declara- 
tion and access, which is necessary to parallelize super- 
instructions. The compiler front-end produces an AST 
(Auxiliary Syntax Tree) that will be processed to gen- 
erate a data-flow graph representation. 

3.1.1 Blocks and Super-Instructions 

The annotation pair #BEGINBLDCK and #ENDBLOCK is 
used to mark blocks of code that will not be compiled 
to data-flow. Those blocks usually contain include flies, 
auxiliary function deflnitions, and global variables dec- 
larations, to be used by super-instruction code in the 
dynamic library. 

Super-instruction annotation is performed according 
to the following syntax: 

treb_super <single I parallel> input (<inputs_list>) 

output (<output_list>) 

#BEGINSUPER 
#ENDSUPER 

Super- instructions declared as single will always 
have only one instance in the data-flow graph, while 
instructions declared as parallel may have multiple 
instances that can run in parallel, depending on the 
placement and availability of resources at the host ma- 
chine. In the example of Fig. [3] (described in more de- 
tails in Section l3.4|) . we have single super-instructions 
at the beginning and end of the computation. In con- 
trast, the inner code corresponds to parallel super- 
instructions. 

3.1.2 Variables 

Couillard requires the programmer to specify how vari- 
ables connect the different super-instructions. More 
precisely, all variables used as inputs or outputs of 
super-instructions must be previously declared to guar- 
antee that data will be exchanged correctly between 
instructions (without loss due to wrong type cast- 
ings). Also, output variables used on parallel super- 
instructions must be declared as follows: 

treb_parout <type> <identif ier> ; 

The Storage Classifier treb_parout is used because 
parallel super-instructions, in general, have multiple in- 
stances. Therefore, output variables of parallel super- 
instructions will also have multiple instance, one for 
each instance of the parallel super-instruction. 



When using a treb_parout variable as input to an- 
other super-instruction (or even in external C-code) it 
is necessary to specify the instance that is being ref- 
erenced. To do so, Couillard provides the following 
syntax: 

<identifier>: :< NUMBER I 

* I 

mytid I 

(mytid + NUMBER) I 
(mytid - NUMBER) I 
lattid> 

Consider a variable named x. The notation a; :: 
refers to instance of variable x, while x :: * refers 
to all instances of this variable (this provides an useful 
abstraction when a super-instruction can receive input 
from a number of sources). Also, it is often convenient 
to refer to the instance for the current (parallel) super- 
instruction. If X is used as input to another parallel 
super-instruction, we can select x through the expres- 
sion X :: mytid. To illustrate this situation, in the ex- 
ample of Fig. 131 each instance (0 < fc < 1, since there 
are 2 instances of each parallel super-instruction) of 
Proc-2A receives as input c :: fc, produced by Proc-1. 
Expression with -|- and — are also allowed with mytid. 
For example, if a parallel super-instruction X produces 
operand a and another parallel super-instructions Y 
uses specifies a :: (mytid — 1) as input, it means that 
for a task i, Y.i will receive a from X.{i — 1). Last, 
the reserved word lasttid refers to the last instance of 
a parallel super-instruction and can be used to specify 
inputs to parallel and single super-instruction. 

For the cases were there are dependencies between 
instances of the same parallel super-instructions we can 
specify input variables using the following construct: 

local. <identif ier>: :<(mytid + NUMBER) I 

(mytid - NUMBER) > 

For example, if we state that a parallel super- 
instruction s produces operand o and receives local. o :: 
{mytid — 2), it means that s.i (instance i of s) depends 
on s.{i — 2). Moreover, it means that s.O and s.l do not 
have local dependencies. We can also specify operands 
that will be sent only to those independent instances 
of s. We use the following syntax: 

starter. <ident if ier>: :< NUMBER I 

* I 

mytid I 

(mytid + NUMBER) I 
(mytid - NUMBER) I 
lattid> 



In the former example if we also define starter.c as 
an input of s, only s.O and s.l will receive this operand. 
A practical example of use of this constructs is to se- 
rialize distributed I/O operation to hide I/O latency, 
explained in Section 

The rationale to describe parallel code in super- 
instructions is simple. The developer first divides the 
code in blocks that can be run in parallel. Initializa- 
tion and termination blocks will most often be single, 
whereas most of the parallel work will be in parallel 
blocks. The programmer next specifies how the blocks 
communicate. If the communication is purely control- 
based the programmer should further add an extra 
variable to specify this connection (a common tech- 
nique in parallel programming). Note that the pro- 
grammer still has to prevent data races between blocks 
unless speculative execution is used (which is not yet 
supported by the compiler). 

3.2. Back-end 

After generating an Abstract Syntax Tree (AST) of 
a program, Couillard produces its corresponding data- 
fiow graph. From this graph, it generates three output 
files: 

1. A .dot file describing the graph in the Graphviz 
[I] notation. This file will be used to create an im- 
age of that graph, using the Graphviz toolchain. 
Although a Graphviz graph is not needed by Tre- 
buchet, it may be useful for academic purposes or 
to provide a more intelligible look of the produced 
graph to the programmers that want to and per- 
form manual adjustments to their applications. 

2. A . f 1 file describing the graph using TALM's ISA. 
This file will be the input to Trebuchefs Assem- 
bler, producing the .fib binary file that will be 
loaded into TrebucheVs Virtual Machine. 

3. A .lib.c file describing the super-instructions as 
functions, in C-code, to be compiled as a dynami- 
cally linked library, using any regular C-compiler. 
All inputs and outputs variables described with 
Couillard syntax are automatically declared and 
initialized within the generated function. No- 
tice also that the super-instruction body does not 
need to parsed by Couillard. It is just treated 
as the value of a super-instruction node at the 
AST representation. This allowed us to focus only 
on the instructions necessary to connect super- 
instructions in a coarse-grained data-fiow graph. 



3.3. Auxiliary Functions 
Line Arguments 



and Command 



The functions, treb_get_tid() and 

treb_get_n_tasks , have been added to Tre- 
buchet virtual machine and they can be called inside 
super-instructions code. The former returns the thread 
id of that super-instruction's instance, while the later 
returns the number of threads. Those functions can 
be used to identify the portion of work to be done by 
each instance. 



(A) 



#BEGINBLOCK //includes, functions and globals 
I #define size 300 

#ENDBLOCK 

int mainO{ 

int a=0, e; treb_parout int b,c,d; 

treb_super single input(a) output(a) 
#BEGINSUPER //INIT. CODE 

#ENDSUPER 

treb_super parallel {starter.a, 

local::b(mytld-l) output(b) 
#BEGINSUPER /*Reads (size/n_tasks) elements 
in each instance*/ 

#ENDSUPER 



treb_super parallel input{b::mytid) output(c) 
#BEGINSUPER //PROC. CODE 



#ENDSUPER 



treb_super parallel input{c::mytid, 

local.d::(inytid-l)) output(d} 
#BEGINSUPER /^Writes (size/n_tasks} elements 
in each instance*/ 

#ENDSUPER J 

treb_super single input(d::lasttid) output(e) 
#BEGINSUPER //END CODE 

#ENDSUPER 

return 0; 




Figure 2. Example of how to hide I/O latency 
with TALM. 



In our system, applications are executed within the 
Trebuchet virtual machine. Therefore, command line 
argument variables cannot be declared within the ap- 
plication's code. They need to be passed trough Tre- 
bucheVs command line. Thus, Trebuchet stores a vec- 
tor of command line arguments and the number of ar- 
guments at treb_superargv and treb_superargc vari- 
ables, respectively. Then, Couillard declares those 
variables as extern when generating the .lib.c file, 
meaning that programmers can access those arguments 
within super-instructions' body. 

3.4. Illustrative Examples 

Figure [2] provides an example of how TALM high- 
level language is used to hide I/O latency in a parallel 
application. In this example we assume that 300 ele- 
ments need to be read from a file, processed and then 



the result must be written in an output file. In pane A 
we can see the different steps to be performed by super- 
instructions (inner code not shown): (i) initialization of 
variables and FILE pointers, (ii) reading, (in) process- 
ing, (iv) writing and (v) closing of files. Pane B shows 
the associated data-flow graph, generated by Couillard. 



® 



int main(){ 

int a=0,b,f=0,con[j=l,cond2; 

treb parout int c,d,e; 

treb super single input(a} output(a} 

#BEGINSUPER 

... ;;iNIT. CODE 

L#ENDSUPER 

while(cond>0) { 

treb_super single input(a) 
output{b,cond2,cond} 
#BEGINSUPER 

... ;;iNPUT/COND. CODE 
#ENDSUPER 

treb_super parallel input{b) 

output(c) 
#BEGINSUPER 

... //PROC. 1 CODE 

#ENDSUPER 

if(cond2>0){ 

treb_super parallel 

input(c::mylD] output(d} 

#BEGINSUPER 

... //PROC. 2AC0DE 

#ENDSUPER 

} 

else{ 

treb_super parallel 

input(c::mylD) output(d} 
#BEGINSUPER 

... //PROC. 2ACODE 
SENDSUPER 

} 

treb_super parallel 

input(d::mylD) output{e) 

#BEGINSUPER 

... //PROC. 3 CODE 

SENDSUPER 



treb_super single input(f,e::*) 

output(f) 
#BEGINSUPER 

... //OUTPUT CODE 
SENDSUPER 



treb_super single lnput(f) 

output(void) 
#BEGINSUPER 

... //END CODE 
#ENDSUPER 
return 0; 




Figure 3. Example of non-linear parallel 
pipeline with TALM. 



One can notice that reading and writing stages are 
described as parallel super-instructions, but since there 
are local inputs, they will be executed serially (al- 
though spread among different PEs). This construct 
allows the execution of each processing task to start as 
soon as the corresponding read operation has flnished, 
instead of waiting for the hole read. It also allows writ- 
ing the results of each processing task i without having 
to wait for tasks a;, where x < j, to finish. 

Figure [3] provides an example of how to use TALM 
high-level language to describe a non-linear parallel 
pipeline. The example is a skeleton code of an ap- 
plication that reads a file containing a bag of tasks to 
be processed and writes the results to another file. The 
processing phase can be divided in 3 stages (Proc-1, 
Proc-2 and Proc-3). The processing task, Proc-2, was 



divided in two different tasks (Proc-2A and Proc-2B), 
that are executed conditionally. Figure [3] (pane A) 
shows TALM annotations, while the corresponding 
data-flow graph for 2 threads, generated by the Couil- 
lard compiler, is shown in Fig. [3] (pane B). 

4. Experiments and Results 

Our goal is to obtain good performance in real ap- 
plications and evaluate the TALM for complex parallel 
programming. We study how our model performs on 
two state-of-the-art benchmarks from the PARSEC |5] 
suite: Blackscholes and Ferret. The experiments were 
executed 5 times in order to remove discrepancies in the 
execution time. We used as parallel platform a machine 
with four AMD Six-Core Opteron'r'^8425 HE (2100 
MHz) chips (24 cores) and 64 GB of DDR-2 667MHz 
(16x4GB) RAM, running GNU/Linux (kernel 2.6.31.5- 
127 64 bits). The machine was running in multi-user 
mode, but no other users were in the machine. 
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Figure 4. Blackscholes results. 

We started our study with a regular application: 
Blackscholes. It calculates the prices for a portfolio of 
European options analytically with the Black-Scholes 
partial differential equation (PDE). There is no closed- 
form expression for the Black-Scholes equation, and as 
such it must be computed numerically. The applica- 
tion reads a file containing the portfolio. Black-Scholes 
partial differential equation for each option in the port- 
folio can be calculated independently. The application 
is parallelized with multiple instances of the process- 
ing thread that will be responsible for a group of op- 
tions. Results are then written sequentially to an out- 
put file. The PARSEC suite already comes with 3 par- 
allel versions of the Blackscholes benchmark: OpenMP, 
Pthreads and TBB. We have produced a Trebuchet ver- 
sion of Blacksholes, following the same patterns present 



in the PARSEC versions to exploit parallelism. How- 
ever, we observed that we can hide I/O latency and 
increase memory locality if we have multiple instances 
of the input and output threads. Thus, we have also 
implemented Blackscholes according to the example 
shown at Section Figure [H 

Figure m shows the results obtained for the Blacksc- 
holes benchmark. Using TALM language, it is possible 
to obtain good performance (comparable to Pthreads 
implementations) in a simple fashion. However, the 
flexibility of the language enables the programmer to 
achieve even greater results employing more complex 
techniques of parallelization. 

The second benchmark we considered is an irregular 
application called Ferret. This application is based on 
the Ferret toolkit which is used for content-based simi- 
larity search. It was developed at Princeton University, 
and represents emerging next-generation search en- 
gines for non-text document data types. Ferret is par- 
allelized using the pipeline model and only a Pthreads 
version is provided with PARSEC. However, we had 
access to a TBB version of Ferret [20] which is also 
used in this experiment. 

First, we have observed that the task size in Fer- 
ret is quite small, and would result in high interpreta- 
tion overheads by the virtual machine, specially when 
using a large number of cores, where the communica- 
tion costs become more apparent. Therefore, we have 
adapted the application to process blocks of five images 
per task, instead of one. 

BPthreads BTBB BTreb Manual 

□ Treb Couillard (no WS) □ Treb Couillard (WS) 
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Figure 5. Ferret results. 

Our parallel version of ferret uses a pipeline pat- 
tern where the I/O stages are single super-instructions 
and processing stages are parallel. We relied at our 
work-stealing mechanism (described in Section [2|) to 
perform dynamic load balancing. Results presented in 
Fig. [5] show that our implementation with work steal- 



ing ( Treb Couillard (WS) at the graphic) obtains close 
to linear speedups, for up to 24 cores, and in fact per- 
forms better than the TBB version, and very close to 
the speedups achieved by the Pthreads version. Also 
one can note that work stealing added a significant con- 
tribution to the application performance (speedups for 
Treh Couillard (no WS) are lower). 

Moreover, we have also prepared a manually fine- 
tuned version of ferret, using over-subscription to rely 
on the operating system to perform load balancing. We 
run Trebuchet with 3 times more PEs than the num- 
ber of used cores and adjust TrebucheVs scheduling 
affinity mechanism to use only the cores necessary for 
each scenario. Results show that it is possible to over- 
come Pthreads' performance. Nevertheless, this minor 
performance gap between a high-level and a manual 
TALM implementation could be reduced with improve- 
ments on the work stealing mechanism and addition of 
code optimization features on Couillard. 

5. Related Work 

Data-flow is an long standing idea in the parallel 
computing community, with a vast amount of work on 
both pure and hybrid architectures [23l [131 E] • Data- 
flow techniques are widely used in areas such as inter- 
nal computer design and stream processing. Swanson's 
WaveScalar Architecture [53] was an important influ- 
ence in our work, as it was a Data-flow architecture 
but also showed that it is possible to respect sequen- 
tial semantics in the data-flow model, and therefore 
run programs written in imperative languages, such as 
C and C++. The key idea in WaveScalar is to decou- 
ple the execution model from the memory interface, so 
that the memory requests are issued according to the 
program order. To do so, WaveScalar relied on com- 
piler to process memory access instructions to guaran- 
tee the program semantics. However, the WaveScalar 
approach requires a full data-flow hardware, that has 
not been achieved in practice. 

Threading Building Blocks (TBB) 1221 is a C++ li- 
brary designed to provide an abstract layer to help pro- 
grammers develop multi-threaded code. TBB enables 
the programmer to specify parallel tasks, which leads 
to a more high-level programming than implementing 
directly the code for threads. Another feature of TBB 
is the use of templates to instantiate mechanisms such 
as pipelines. The templates, however, have limitations. 
For instance, only linear pipelines can be described us- 
ing the pipeline template. 

Another project that relies on code augmentation 
for parallelization is DDMCPP [H]. DDMCPP is a 
preprocessor for the Data Driven Multithreading model 



[17) , which, like TALM, is based on dynamic data-flow. 

HMPP [TU] is "an Heterogeneous Multi-core Parallel 
Programming environment that allows the integration 
of heterogeneous hardware accelerators in a seamless 
intrusive manner while preserving the legacy code" . It 
provides a run time environment, a set of compilation 
directives and a preprocessor, so that the program- 
mer can specify portions of accelerator codes, called 
codelets, that can run at GPGPU, FPGAs, a remote 
machine (using MPI) or the CPU itself. Codelets are 
pure functions, without side-effects. Multiple codelets 
implemented for different hardware can exist and the 
runtime environment will chose which codelet will run, 
according to hardware availability and compile direc- 
tives previously specified. The runtime environment 
will also be responsible for the data transfers to/from 
the hardware components involved in the computation. 

The Galois System [H [TU [H] is an "object-based 
optimistic parallelization system for irregular applica- 
tions". It comprises: (i) syntactic constructs for pack- 
ing optimistic parallelism as iteration over ordered and 
unordered sets, (ii) a runtime system to detect unsafe 
accesses to shared memory and perform the necessary 
recovery operations and (Hi) assertions about meth- 
ods in class libraries. Instead of tracking memory ad- 
dresses accessed by optimistic code, Galois tracks high- 
level semantics violation on abstract data types. For 
each method that will perform accesses to shared mem- 
ory, the programmer needs to describe which methods 
can be commuted without conflicts (and under which 
circumstances). Gallois also introduces an alternative 
method to the commutative checks, since it may be 
costly [TS]. Shared data is partitioned, attributed do 
the different processing cores and the system monitors 
if partitions are being "touched" by concurrent threads 
(which would raise a conflict). Despite the detection 
method used, the programmer needs to describe, for 
each method that access shared objects, an inverse 
method that will be executed in case of rollback. The 
runtime system is in charge of detecting conflicts, call- 
ing inverse methods and commanding re-execution. 

6. Conclusions and Future Work 

We have presented the Couillard compiler, that 
compiles an extension of the C-language into TALM 
code. Initial evaluation on state-of-the-art parallel ap- 
plications showed TALM code, generated by Couillard 
and running on Trebuchet (a TALM implementation 
for multicores), to be competitive with handcrafted 
Pthreads and TBB code, up to 24 processors. Eval- 
uation also shows that we can significantly improve 
performance by simply experimenting with the connec- 



tivity and grain of the building-blocks, supporting our 
claim that Couillard provides a flexible and scalable 
framework for parallel computing. 

Work on improving Trebuchet continues. Flexible 
scheduling is an important requirement in irregular ap- 
plications, we thus have been working on improving the 
work stealing mechanism for Trebuchet runtime envi- 
ronment. Moreover, placement has a strong impact 
on applications performance and scalability. We are 
therefore studying efficient ways to perform automatic 
placement on Trebuchet. 

We are also working on refining Couillard and on in- 
troducing new features to the support library. Extend- 
ing Couillard to allow the use of templates to describe 
application that fit well known parallel patterns and 
to enable the use of Trebuchet's memory speculation 
mechanisms [Hj are subject of ongoing research. This 
work is based in our experience with porting actual ap- 
plications to the framework. Thus, finding applications 
that are interesting candidates to be parallelized with 
Couillard is constantly within our research goals. 

TALM's super-instructions could also be imple- 
mented to different hardware, using different lan- 
guages, as in HMPP [10], as long as there is a way to 
call them from our virtual machine. Currently, super- 
instructions are compiled as functions in a dynamically 
linked library, but a interface to call GPGPU or FPGA 
accelerators and perform data-transfers could also be 
created in our environment. This is subject to on-going 
work. 
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